{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "350a2d4c",
   "metadata": {},
   "source": [
    "<center><font size = \"6\">南方都市报微信公众号内容</font></center>  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "798e73b6",
   "metadata": {},
   "source": [
    "# 数据挖掘 — 南方都市报微信公众号内容抓取\n",
    "## 项目要求 \n",
    "- 使用selenium进入微信公众平台\n",
    "- 在微信公众平台寻找指定的公众号\n",
    "- 抓取该公众号指定时间区间的文章（不低于50页数据/不低于1年的数据）\n",
    "- 导出文章信息（应包含标题，时间，文章url链接以及文章文本内容）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f2ec95c",
   "metadata": {},
   "source": [
    "# 准备工作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "9fb86893",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-54-8ae4025e7ff4>:19: DeprecationWarning: use options instead of chrome_options\n",
      "  driver = webdriver.Chrome( chrome_options = opts) #desired_capabilities=caps,\n"
     ]
    }
   ],
   "source": [
    "# 导入所需模块\n",
    "from selenium import webdriver\n",
    "from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from lxml.html import fromstring\n",
    "import time\n",
    "from random import random\n",
    "from requests_html import HTMLSession\n",
    "\n",
    "#caps=dict()\n",
    "#caps[\"pageLoadStrategy\"] = \"none\"   # Do not wait for full page load\n",
    "\n",
    "opts = webdriver.ChromeOptions()\n",
    "opts.add_argument('--no-sandbox')#解决DevToolsActivePort文件不存在的报错\n",
    "opts.add_argument('window-size=1920x3000') #指定浏览器分辨率\n",
    "opts.add_argument('--disable-gpu') #谷歌文档提到需要加上一这个属性来规避bug\n",
    "opts.add_argument('--hide-scrollbars') #隐藏滚动条, 应对些特殊页面\n",
    "#opts.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度\n",
    "#opts.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败\n",
    "# opts.binary_location = \"C:\\portable\\PortableApps\\IronPortable\\App\\Iron\\chrome.exe\"\n",
    "# opts.binary_location = \"C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe\" #\"H:\\_coding_\\Gitee\\InternetNewMedia\\CapstonePrj2016\\chromedriver.exe\"  \n",
    "\n",
    "\n",
    "driver = webdriver.Chrome( chrome_options = opts) #desired_capabilities=caps,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "53cd94bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 输入“公众号”参数\n",
    "公众号 = \"南方都市报\"\n",
    "# 指定内容输出的位置\n",
    "fn = { \"output\" : { \"公众号_htm_snippets\": \"data_raw_src_/公众号_htm_snippets_{公众号}.tsv\",\n",
    "                    \"公众号_df\": \"data_raw_src_/公众号_df_{公众号}.tsv\",\n",
    "                    \"公众号_xlsx\": \"公众号_url_{公众号}.xlsx\" } \\\n",
    "      }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "1c536e1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 网址信息\n",
    "driver.get(\"https://mp.weixin.qq.com\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1631595a",
   "metadata": {},
   "source": [
    "# 自动化登录 — 需扫码操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "0cdaff78",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 账号、密码信息\n",
    "payload =  {\"account\": \"请输入您的账号\", \"password\": \"请输入您的密码\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "82f8341d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 点击使用账号、密码登录\n",
    "element = driver.find_element_by_xpath('//a[@class=\"login__type__container__select-type\"]')\n",
    "# 不要直接 click（） 等相关操作，首先要检查是否通过 xpath 找的正确的element\n",
    "element.get_attribute('innerHTML')\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "e42a8c80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 输入账号 — clear（）操作清除 以防输入框内有内容\n",
    "element = driver.find_element_by_xpath('//input[@name=\"account\"]')\n",
    "element.get_attribute('innerHTML')\n",
    "element.clear()\n",
    "element.send_keys(payload['account'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "39e6597c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 输入密码\n",
    "element = driver.find_element_by_xpath('//input[@name=\"password\"]')\n",
    "element.get_attribute('innerHTML')\n",
    "element.clear()\n",
    "element.send_keys(payload['password'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "5a2ce283",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 登录跳转\n",
    "element = driver.find_element_by_xpath('//a[@class=\"btn_login\"]')\n",
    "element.get_attribute('innerHTML')\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fa1b10f",
   "metadata": {},
   "source": [
    "# 寻找选单"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "6d0aeae2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 公众号页面左侧展开\n",
    "element = driver.find_element_by_xpath('//a[@id=\"m_open\"]')\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "d9aff39b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 点击图文素材\n",
    "element = driver.find_element_by_xpath('/html/body/div[4]/div[2]/ul/li[2]/ul/li[1]/a') \n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "a62b5447",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 点击 \"+\"  新的创作\n",
    "element = driver.find_element_by_xpath('//i[@class=\"weui-desktop-card__icon-add\"]')\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "10109b77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# “写新图文” —>  链接跳转\n",
    "element = driver.find_element_by_xpath('//a//i[@class=\"icon-svg-editor-appmsg\"]') \n",
    "element.click()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "348794ba",
   "metadata": {},
   "source": [
    "# 窗口信息检查 — 并定位在当前窗口下进行操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "f15c385b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['CDwindow-793FE245EE061BC5BAEDF9AD83BBA89A',\n",
       " 'CDwindow-FB7D9B1D77DB9F2767BF73B3620E7CC5']"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 两个窗口下 进行窗口定位 \n",
    "# 窗口信息检查（>1）\n",
    "driver.window_handles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "20efda39",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-67-0188c2a7ff70>:2: DeprecationWarning: use driver.switch_to.window instead\n",
      "  driver.switch_to_window(driver.window_handles[1])\n"
     ]
    }
   ],
   "source": [
    "# 窗口切换\n",
    "driver.switch_to_window(driver.window_handles[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "328869a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 点击超链接\n",
    "element = driver.find_element_by_xpath('//li[@id=\"js_editor_insertlink\"]') \n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "3acc56e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 点击选择其它公众号\n",
    "element = driver.find_element_by_xpath('//button[@class=\"weui-desktop-btn weui-desktop-btn_default\"]')\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "56fef0a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# input 输入关键词\n",
    "element = driver.find_element_by_xpath('//input[@placeholder=\"输入文章来源的公众号名称或微信号，回车进行搜索\"]')\n",
    "element.get_attribute('innerHTML')\n",
    "element.clear()\n",
    "element.send_keys(公众号)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "deecd28a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-icon weui-desktop-icon__search weui-desktop-icon__small\" style=\"width: 20px; height: 20px;\"><!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!---->     <svg viewBox=\"0 0 24 24\" version=\"1.1\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"><title>MP/Icon/Search</title> <g id=\"MP/Icon/Search\" stroke=\"none\" stroke-width=\"1\" fill=\"none\" fill-rule=\"evenodd\"><path d=\"M5.78025253,5.78248558 C8.51392257,3.04881554 12.9460774,3.04881554 15.6797475,5.78248558 C18.1730922,8.27583028 18.3922898,12.1821488 16.3373403,14.9239313 L20.6294949,19.2175144 L19.2152814,20.631728 L14.922508,16.3389663 C12.180685,18.394566 8.27384272,18.1755707 5.78025253,15.6819805 C3.04658249,12.9483105 3.04658249,8.51615562 5.78025253,5.78248558 Z M6.8409127,6.84314575 C4.6930291,8.99102935 4.6930291,12.4734367 6.8409127,14.6213203 C8.98879631,16.7692039 12.4712037,16.7692039 14.6190873,14.6213203 C16.7669709,12.4734367 16.7669709,8.99102935 14.6190873,6.84314575 C12.4712037,4.69526215 8.98879631,4.69526215 6.8409127,6.84314575 Z\" id=\"形状\"></path></g></svg> <!----> <!----> <!----> <!----> <!----></div>\n"
     ]
    }
   ],
   "source": [
    "# 点放大镜搜\n",
    "element = driver.find_element_by_xpath('//button[@class=\"weui-desktop-icon-btn weui-desktop-search__btn\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "4ea9018d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/3OEpTPib0kVicrzDWicH7JWJpiacbysXllOEbCSNUJCZxEZEhgP51W1Y8om1ZHyxlfMw7dBSw2IgDWCjshddggeiaFQ/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">南方都市报</strong> <i class=\"inner_link_account_wechat\">微信号：nddaily</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/jribTUbtKkD0sjMf7v2lhz06ZhFcdW38XdraMAfNB49dbqgibCzZx3E6xwDxa1WxaKlaHK3n0yWic6P6P3CibYwVwQ/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">都市报童</strong> <i class=\"inner_link_account_wechat\">微信号：dushibaotong</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/zzoUJzlKT41tz4jfPPzxI6nWsXQdo458HkQRw6RfAWbBIPVT9NcibYTzYGQQn3l6Fs5dLwenuwBMyS03S28rOqg/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">爱南方</strong> <i class=\"inner_link_account_wechat\">微信号：未设置</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">服务号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/zlrHiaWN66fpUy0XLwTteT1ibM39FwW8ylgEFghOuOd4SWwCX3IdDWoVaHaErYbIWU4ic1ibXSI4DQVGqZ9g6pS0Vg/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">正點观影</strong> <i class=\"inner_link_account_wechat\">微信号：nd_ent</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li>\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "公众号SERP = main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "a4da9cd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 解析\n",
    "import pandas as pd\n",
    "from lxml.html import fromstring\n",
    "root = fromstring(公众号SERP) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "8c120189",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nickname</th>\n",
       "      <th>wechat</th>\n",
       "      <th>img</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>南方都市报</td>\n",
       "      <td>微信号：nddaily</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/3OEpTPib0kVicrz...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>都市报童</td>\n",
       "      <td>微信号：dushibaotong</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/jribTUbtKkD0sjM...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>爱南方</td>\n",
       "      <td>微信号：未设置</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/zzoUJzlKT41tz4j...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>正點观影</td>\n",
       "      <td>微信号：nd_ent</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/zlrHiaWN66fpUy0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  nickname            wechat  \\\n",
       "0    南方都市报       微信号：nddaily   \n",
       "1     都市报童  微信号：dushibaotong   \n",
       "2      爱南方           微信号：未设置   \n",
       "3     正點观影        微信号：nd_ent   \n",
       "\n",
       "                                                 img  \n",
       "0  http://mmbiz.qpic.cn/mmbiz_png/3OEpTPib0kVicrz...  \n",
       "1  http://mmbiz.qpic.cn/mmbiz_png/jribTUbtKkD0sjM...  \n",
       "2  http://mmbiz.qpic.cn/mmbiz_png/zzoUJzlKT41tz4j...  \n",
       "3  http://mmbiz.qpic.cn/mmbiz_png/zlrHiaWN66fpUy0...  "
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "主 = root.xpath('//li[@class=\"inner_link_account_item\"]')\n",
    "\n",
    "account_list = []\n",
    "for e in 主:\n",
    "    account_nickname = e.xpath('./div/strong[@class=\"inner_link_account_nickname\"]')[0].text\n",
    "    account_wechat = e.xpath('./div/i[@class=\"inner_link_account_wechat\"]')[0].text\n",
    "    account_img = e.xpath('./div/img/@src')[0]\n",
    "    account = {\"nickname\": account_nickname, \"wechat\": account_wechat, \"img\": account_img,}\n",
    "    account_list.append(account)\n",
    "\n",
    "df_account = pd.DataFrame(account_list)\n",
    "df_account"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6413d164",
   "metadata": {},
   "source": [
    "# 获取公众号文章链接和正文"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "813936b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/3OEpTPib0kVicrzDWicH7JWJpiacbysXllOEbCSNUJCZxEZEhgP51W1Y8om1ZHyxlfMw7dBSw2IgDWCjshddggeiaFQ/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">南方都市报</strong> <i class=\"inner_link_account_wechat\">微信号：nddaily</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div>\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]/li')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "43557553",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n跳转_input = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/input\\')\\n跳转_a = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/a\\')\\n跳转_title = driver.find_element_by_xpaht(\\'//div[@class=\"inner_link_article_title\"]//span//text()\\')\\n跳转_input.clear()\\n跳转_input.send_keys(2)\\n跳转_a.click()\\n'"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 跳转testing\n",
    "'''\n",
    "跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "跳转_title = driver.find_element_by_xpaht('//div[@class=\"inner_link_article_title\"]//span//text()')\n",
    "跳转_input.clear()\n",
    "跳转_input.send_keys(2)\n",
    "跳转_a.click()\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "2601943c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 1333]\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "# 跳转上限\n",
    "l_e = driver.find_elements_by_xpath('//label[@class=\"weui-desktop-pagination__num\"]')\n",
    "l_e_int  = [int(x.text) for x in l_e] \n",
    "print (l_e_int)\n",
    "print (l_e_int[0]==l_e_int[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "720ed797",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180, 1181, 1182, 1183, 1184, 1185, 1186, 1187, 1188, 1189, 1190, 1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259, 1260, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1279, 1280, 1281, 1282, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, 1294, 1295, 1296, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312, 1313, 1314, 1315, 1316, 1317, 1318, 1319, 1320, 1321, 1322, 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331, 1332, 1333]\n"
     ]
    }
   ],
   "source": [
    "pages = list(range(l_e_int[0],l_e_int[-1]+1 ))\n",
    "#print(pages[0:2])\n",
    "pages = list(range(1,l_e_int[-1]+1 ))\n",
    "print(pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "54555f30",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取前 60 页的内容\n",
    "pages = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "2abef713",
   "metadata": {},
   "outputs": [],
   "source": [
    "# global varialbes \n",
    "# 循环 遍历\n",
    "html_raw = dict()\n",
    "main_content =\"\"\n",
    "element = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "8a986761",
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_pages (pages):\n",
    "    for p in pages:\n",
    "        print (p,end='\\t')\n",
    "\n",
    "        跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "        跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "        跳转_input.clear()\n",
    "        跳转_input.send_keys(p)\n",
    "        跳转_a.click()\n",
    "\n",
    "        time.sleep(45+120*random())\n",
    "\n",
    "        element = driver.find_element_by_xpath('//div[@class=\"inner_link_article_list\"]')\n",
    "        main_content = element.get_attribute('innerHTML')\n",
    "        #print(main_content)\n",
    "        html_raw[p] = main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "4df23321",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t11\t12\t13\t14\t15\t16\t17\t18\t19\t20\t21\t22\t23\t24\t25\t26\t27\t28\t29\t30\t31\t32\t33\t34\t35\t36\t37\t38\t39\t40\t41\t42\t43\t44\t45\t46\t47\t48\t49\t50\t51\t52\t53\t54\t55\t56\t57\t58\t59\t60\t"
     ]
    }
   ],
   "source": [
    "process_pages (pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "bd3e44c5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>html_snippets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>&lt;div class=\"weui-desktop-radio-group\"&gt;&lt;label c...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       html_snippets\n",
       "1  <div class=\"weui-desktop-radio-group\"><label c..."
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([html_raw]).T\n",
    "df.columns = [\"html_snippets\"]\n",
    "df.loc[0:1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "b21f72df",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored 'html_raw' (dict)\n"
     ]
    }
   ],
   "source": [
    "%store html_raw\n",
    "import pickle \n",
    "filehandler = open(\"html_raw\", 'wb') \n",
    "pickle.dump(html_raw, filehandler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "acfd3da3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "59\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>html_snippets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>&lt;div class=\"weui-desktop-radio-group\"&gt;&lt;label c...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        html_snippets\n",
       "12  <div class=\"weui-desktop-radio-group\"><label c..."
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# df.duplicated()  默认所有列，无重复记录  【duplicated()函数】判断是否有重复项\n",
    "df_out = df[~df.duplicated()]\n",
    "print (len(df_out))\n",
    "df[df.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "3754e52e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[12]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[12]"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "try_again = list(df[df.duplicated()].index)\n",
    "print(try_again)\n",
    "try_again = try_again + list (set(pages).difference(set(df.index.values)))\n",
    "try_again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "8309ee19",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 暂存档\n",
    "filename = fn [\"output\"] [\"公众号_htm_snippets\"] \n",
    "df_out.to_csv(filename.format(公众号=公众号), sep=\"\\t\", encoding=\"utf8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "c5770849",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "40,40,40,40,40,40,40,40,40,39,39,40,40,40,40,40,40,40,39,40,40,40,40,40,40,40,40,40,39,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,39,39,40,39,39,40,40,40,40,39,40,30,30,"
     ]
    }
   ],
   "source": [
    "def get_content(link):\n",
    "    session = HTMLSession()\n",
    "    r = session.get(url=link)\n",
    "    content_xpath_1 = '//*[@id=\"js_content\"]//span/text()'\n",
    "    content_xpath_2 = '//*[@id=\"js_content\"]//p/text()'\n",
    "    content_1 = ''.join(r.html.xpath(content_xpath_1))\n",
    "    content_2 = ''.join(r.html.xpath(content_xpath_2))\n",
    "    return content_1 + content_2\n",
    "\n",
    "def parse_html_snippets(_snippet_):\n",
    "    root = fromstring(_snippet_) \n",
    "    title = [x.text for x in root.xpath('//div[@class=\"inner_link_article_title\"]//span[2]')]\n",
    "    create_time = [x.text for x in root.xpath('//div[@class=\"inner_link_article_date\"]')]\n",
    "    link = [x for x in root.xpath('//a/@href')]\n",
    "    content_text = [get_content(x) for x in link]\n",
    "    _df_ = pd.DataFrame({\"title\":title, \"create_time\": create_time, \"link\":link, \"content_text\":content_text})\n",
    "    return(_df_)\n",
    "    \n",
    "l_df = []\n",
    "for p in pages:\n",
    "    _df_ = parse_html_snippets(df.loc[p,\"html_snippets\"])\n",
    "    print (len(_df_), end=\",\")\n",
    "    l_df.append(_df_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "281be3ee",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "      <th>content_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>最新！茂名新增1例无症状感染者，系广州荔湾茶楼服务员</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据“茂名发布”今天（5月25日）中午13时06分消息：5月25日，广东省茂名市在对广州新冠肺...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>广西一男子核酸检测阳性，曾在广州荔湾隔离14天</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据广西南宁市卫生健康委员会今天（5月25日）通报，2021年5月24日，南宁市根据广州市荔湾...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>人人人人人……最近广州多地火爆，突破1000万了</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据国家卫生健康委官网消息截至2021年5月23日31个省（区、市）及新疆生产建设兵团累计报告...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>这种玩具走红，不少人沉迷！所有人注意，千万别买</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>如今孩子们的玩具种类可以说是五花八门近年，一种叫“假水”的玩具在中小学生之间迅速走红▼甚至不...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>广州一高架桥桥墩出现沉降，已封闭</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>南都记者获悉，2021年5月24日，第三方检测单位在广州广园路下塘西立交桥桥墩开展日常检测时...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2366</th>\n",
       "      <td>终于，人设崩塌！</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>去年立的Flag都实现了吗？牛年又想立下什么Flag呢？新年的好意头都在福盒里啦点击下图问号...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2367</th>\n",
       "      <td>“中国最后一个原始部落”突发火灾</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据云南沧源佤族自治县人民政府新闻办公室消息，2月14日17时40分，该县勐角民族乡翁丁村老寨...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2368</th>\n",
       "      <td>夫妻要求医院返还冷冻胚胎，判了</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>人体胚胎含有DNA遗传物质具有生命属性，不是民法上的一般物，在医疗服务合同目的落空后，当事人...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2369</th>\n",
       "      <td>广东这个地方，美呆了！</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>由于天气持续温暖，今年入春不久的潮州，迎来了韩江两岸木棉花的绽放。繁花满树，艳丽夺目，在古城...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2370</th>\n",
       "      <td>东莞一女子误将1.1万现金当垃圾扔了，结果……</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>2月12日（大年初一），东莞市塘厦镇市民陈女士在垃圾填埋场，紧握着几名环卫工人的手，连声说着...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2371 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                           title create_time  \\\n",
       "0     最新！茂名新增1例无症状感染者，系广州荔湾茶楼服务员  2021-05-25   \n",
       "1        广西一男子核酸检测阳性，曾在广州荔湾隔离14天  2021-05-25   \n",
       "2       人人人人人……最近广州多地火爆，突破1000万了  2021-05-25   \n",
       "3        这种玩具走红，不少人沉迷！所有人注意，千万别买  2021-05-25   \n",
       "4               广州一高架桥桥墩出现沉降，已封闭  2021-05-25   \n",
       "...                          ...         ...   \n",
       "2366                    终于，人设崩塌！  2021-02-15   \n",
       "2367            “中国最后一个原始部落”突发火灾  2021-02-15   \n",
       "2368             夫妻要求医院返还冷冻胚胎，判了  2021-02-15   \n",
       "2369                 广东这个地方，美呆了！  2021-02-15   \n",
       "2370     东莞一女子误将1.1万现金当垃圾扔了，结果……  2021-02-15   \n",
       "\n",
       "                                                   link  \\\n",
       "0     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "1     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "3     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "4     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "...                                                 ...   \n",
       "2366  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2367  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2368  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2369  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2370  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "\n",
       "                                           content_text  \n",
       "0     据“茂名发布”今天（5月25日）中午13时06分消息：5月25日，广东省茂名市在对广州新冠肺...  \n",
       "1     据广西南宁市卫生健康委员会今天（5月25日）通报，2021年5月24日，南宁市根据广州市荔湾...  \n",
       "2     据国家卫生健康委官网消息截至2021年5月23日31个省（区、市）及新疆生产建设兵团累计报告...  \n",
       "3     如今孩子们的玩具种类可以说是五花八门近年，一种叫“假水”的玩具在中小学生之间迅速走红▼甚至不...  \n",
       "4     南都记者获悉，2021年5月24日，第三方检测单位在广州广园路下塘西立交桥桥墩开展日常检测时...  \n",
       "...                                                 ...  \n",
       "2366  去年立的Flag都实现了吗？牛年又想立下什么Flag呢？新年的好意头都在福盒里啦点击下图问号...  \n",
       "2367  据云南沧源佤族自治县人民政府新闻办公室消息，2月14日17时40分，该县勐角民族乡翁丁村老寨...  \n",
       "2368  人体胚胎含有DNA遗传物质具有生命属性，不是民法上的一般物，在医疗服务合同目的落空后，当事人...  \n",
       "2369  由于天气持续温暖，今年入春不久的潮州，迎来了韩江两岸木棉花的绽放。繁花满树，艳丽夺目，在古城...  \n",
       "2370  2月12日（大年初一），东莞市塘厦镇市民陈女士在垃圾填埋场，紧握着几名环卫工人的手，连声说着...  \n",
       "\n",
       "[2371 rows x 4 columns]"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out = pd.concat(l_df).reset_index(drop=True)\n",
    "df_url_out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "82a49fab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ==&mid=2651043870&idx=1&sn=b546f644439f3af423623db0cfd081af&chksm=4794bbf070e332e639d8ba89a95c13c905044bead4092cf86a01b20bd5762a4489a57509a5fb#rd'"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 试验公众号文章链接是否正确\n",
    "df_url_out.loc[0].link"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "32c19d3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "      <th>content_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>438</th>\n",
       "      <td>蒙牛道歉了</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>继昨天（5月6日）深夜爱奇艺道歉后，今早（5月7日）9时许，《青春有你3》赞助商、“倒奶事件...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>439</th>\n",
       "      <td>非法穿越，噩耗传来</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>3日，有网友爆料，在陕西宝鸡太白县秦岭，有驴友非法穿越“鳌太线”，多人失联。昨日（6日）上午...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>440</th>\n",
       "      <td>选手被斥“你算什么东西”，最新回应</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>日前，第十三届中国音乐金钟奖古筝比赛贵州选拔赛现场，有选手质疑评审公正性被怼“你算什么东西”...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>441</th>\n",
       "      <td>盖茨妻子聘顶级律师！他给特朗普两妻子打过离婚官司</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>微软公司创始人与妻子、慈善家日前宣布将离婚，结束长达27年的婚姻（）。两人的1300亿美元资...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>442</th>\n",
       "      <td>珠峰大本营至少20人严重咳嗽，尼泊尔现“双突变”病毒病例</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>尼泊尔新冠肺炎确诊病例激增、多名登山者出现严重咳嗽症状持续引发关注。5月6日，据央视新闻消息...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>443</th>\n",
       "      <td>不喊英语，改粤语了！</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>2021年4月15日，香港警察学院操场。当香港警察仪仗队肩扛的旗帜顶端出现在检阅场入口时，安...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>444</th>\n",
       "      <td>82岁“高知老人”称被骗走12万，养老院回应</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>前晚（5月5日），一则“82岁北京高知退休老人称被养老院骗走12万”的消息，引发关注。昨天（...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>445</th>\n",
       "      <td>超酷！潮汕00后男生制作纸片定格动画，走红网络</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>近日，广东揭阳00后男生制作的纸片定格动画《蓝鲸》在网上走红。苏梓凡是来自湖北美术学院的大二...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>446</th>\n",
       "      <td>爱奇艺深夜道歉</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>昨天（5月6日）深夜23时46分，@爱奇艺 发微博称“我们真诚地道歉！”：我们听到了用户及媒...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>447</th>\n",
       "      <td>香港女歌手去世！仅31岁，患罕见癌症</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>5月4日，香港娱乐圈再次传出噩耗。据媒体报道，有着“抗癌歌手”之称的31岁女星（Sarena...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>448</th>\n",
       "      <td>红衣女子阳台外跳舞坠楼，警方通报</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>昨天（5月6日），海南三亚一“红衣女子在高楼阳台外跳舞随后坠楼”的视频引发关注。5月6日18...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>449</th>\n",
       "      <td>高端留学机构跑路！“学霸男神”CEO突然住院</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>近日，南都记者接到来自广州、深圳以及上海等地近20位家长投诉，反映其在高端出国留学机构“藤门...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>450</th>\n",
       "      <td>夏天穿这4件小吊带，清凉不显胖！还巨优雅</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>穿上身，肩带轻轻环绕，勾勒出美丽的肩颈线条，飘然垂坠，凉快又优雅。很多人觉得“性感”=吊带=...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>451</th>\n",
       "      <td>中方严正回应</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>“外交部发言人办公室”消息，在5月6日外交部例行记者会上，总台央视记者提问：5日，七国集团外...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>452</th>\n",
       "      <td>又来！游乐设备故障，游客被悬半空！市监总局发声</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>河北承德市某游乐场一台游乐设施“高空飞翔”5月2日发生故障、致27人受困后，5月4日，国家市...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>453</th>\n",
       "      <td>“累吐了”！为救200斤男子，他飞奔跳入湖中</td>\n",
       "      <td>2021-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>5月3日，江苏淮安，酒后在洪泽湖边散步，在湖边洗脸时不慎落入水中。岸边的家人都不熟悉水性，便...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>454</th>\n",
       "      <td>暴雨后这些小飞虫又成群出现！专家提醒：记得关灯！</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>这个季节的广东人真的太！难！了！前几天的暴雨后它们 又来了！就是飞！蚁！（也叫每次大雨过后都...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>455</th>\n",
       "      <td>老师揪小学生头发，致皮骨分离！已刑拘</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>因上课时“讲小话”，9岁的被老师拖拽头发到讲台罚站，事后孩子头部异常肿胀，被查出头皮头骨分离...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>456</th>\n",
       "      <td>氧气短缺、全面停航！又一国疫情急剧恶化</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>“局势已经失去控制，我们正处于无助的境地”，这是近日一位尼泊尔的医生就当地新冠肺炎疫情发出的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>457</th>\n",
       "      <td>穗康码冲上热搜！网友都在晒图</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>近日，山东将已经完成疫苗接种人群的健康码，升级为金色健康码。加了镶金的“金钟罩”让全国网友变...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>458</th>\n",
       "      <td>如果你没空读书，就一定要来看看这11个公众号</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>♥长按，选择订阅▼ID：mrxdsfz推荐理由： 你是不是总因为“没时间而放弃阅读”，现在天...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>459</th>\n",
       "      <td>凤凰金融被查，控制人被抓</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>昨晚（5月5日），海南海口公安官方发布警情通报称，2021年4月30日，海口市公安局龙华分局...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>460</th>\n",
       "      <td>世界最大性侵儿童网站被捣毁！曾有超40万会员</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>德国警方5月3日捣毁了，抓获4名被控经营该网站的4名嫌疑人。据悉，该网站自2019年7月成立...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>461</th>\n",
       "      <td>华夏银行交出2020社会责任“成绩单”</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>近日，华夏银行正式发布《华夏银行股份有限公司2020年社会责任报告》。这是该行自2009年建...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>462</th>\n",
       "      <td>有主张，有担当！</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>“少年不负勇往，热爱正当时”。伴随着这句话，从一个个优秀青年的胸腔中迸发而出，视频《新少年说...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>463</th>\n",
       "      <td>惠州一镇最新通报</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>五一假期期间，有广东惠州市民收到这样一则信息，“5月2日晚接到上级核查通知，我镇荷树塘村三坑...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>464</th>\n",
       "      <td>“肿瘤治疗黑幕”相关录音曝光！医生曾说……</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>“肿瘤治疗黑幕”风波中提供“NK细胞免疫治疗”的机构，藏身上海长宁区一个企业广场的写字楼中，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>465</th>\n",
       "      <td>国家发改委重磅声明</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>今早（5月6日），国家发展改革委发布“关于无限期暂停中澳战略经济对话机制下一切活动的声明”：...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>466</th>\n",
       "      <td>高架坍塌，地铁坠落！已致25死</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据央视新闻报道，当地时间3日晚上，墨西哥首都墨西哥城东南部一段高架轨道发生坍塌，经行列车的多...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>467</th>\n",
       "      <td>今早，黄之锋再获刑</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>今天（5月6日）上午，乱港分子等四人承认2020年一起非法集结控罪，四人在香港区域法院分别被...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>468</th>\n",
       "      <td>五一这群俊男靓女嗨翻了！结束后留一片狼藉</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>5月3日，江苏南京，草莓音乐节演出结束以后，现场垃圾遍布，工作人员冒雨捡垃圾。有工作人员说，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>469</th>\n",
       "      <td>破纪录！破纪录！五一档票房16亿+，但多部电影口碑大翻车</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>继史上最强清明档后，中国电影又迎来了史上最“挤”五一档。13部电影扎根“五一档”同台竞技，数...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>470</th>\n",
       "      <td>“肿瘤治疗黑幕”发帖医生再发声：忍无可忍！</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>昨天（5月5日），“北医三院肿瘤内科医生反映肿瘤治疗黑幕”事件当事人再度发声。在半个月的沉默...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471</th>\n",
       "      <td>4人窒息死亡！官方通报</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>5月4日，广东省应急管理厅通报了一起“五一”期间有限空间作业致4名人员死亡的较大事故。5月1...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>472</th>\n",
       "      <td>饭馆老板泼热油致多人受伤，警方介入</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>5月1日22时许，在湖南衡阳雁峰区经营餐馆的彭先生被相邻饭店的老板泼热油，造成多人受伤。彭先...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>473</th>\n",
       "      <td>25岁产妇诞下九胞胎，网友：小说成真了</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据新华国际报道，西非国家马里一名25岁产妇4日在摩洛哥一家医院产下九胞胎，比产前超声波检测到...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>474</th>\n",
       "      <td>能扛海边暴晒，能防电脑辐射，这才叫「真正的防晒衣」！</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>（UVA、UVB），这样的颜值，在也毫不逊色哇，顺便还能保护肌肤不受蓝光伤害，一举两得。更多...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>475</th>\n",
       "      <td>“错换人生28年”起诉涉事医院案将开庭</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>昨天（5月5日），南都记者从“错换人生28年”当事人养母处获悉，其与家人起诉河南大学淮河医院...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>476</th>\n",
       "      <td>“抹黑网约车司机”，短视频博主道歉</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>“因为我的投诉，滴滴司机被罚了5000块。”近日，几个内容高度相似的视频在短视频平台上传播，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>477</th>\n",
       "      <td>情侣沙滩边热吻，保安狂喊：别亲了！</td>\n",
       "      <td>2021-05-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>5月3日，上海渔人码头一人工沙滩海水涨潮，一对情侣在水中热吻忘记上岸，保安大喊别亲了，涨潮了...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            title create_time  \\\n",
       "438                         蒙牛道歉了  2021-05-07   \n",
       "439                     非法穿越，噩耗传来  2021-05-07   \n",
       "440             选手被斥“你算什么东西”，最新回应  2021-05-07   \n",
       "441      盖茨妻子聘顶级律师！他给特朗普两妻子打过离婚官司  2021-05-07   \n",
       "442  珠峰大本营至少20人严重咳嗽，尼泊尔现“双突变”病毒病例  2021-05-07   \n",
       "443                    不喊英语，改粤语了！  2021-05-07   \n",
       "444        82岁“高知老人”称被骗走12万，养老院回应  2021-05-07   \n",
       "445       超酷！潮汕00后男生制作纸片定格动画，走红网络  2021-05-07   \n",
       "446                       爱奇艺深夜道歉  2021-05-07   \n",
       "447            香港女歌手去世！仅31岁，患罕见癌症  2021-05-07   \n",
       "448              红衣女子阳台外跳舞坠楼，警方通报  2021-05-07   \n",
       "449        高端留学机构跑路！“学霸男神”CEO突然住院  2021-05-07   \n",
       "450          夏天穿这4件小吊带，清凉不显胖！还巨优雅  2021-05-07   \n",
       "451                        中方严正回应  2021-05-07   \n",
       "452       又来！游乐设备故障，游客被悬半空！市监总局发声  2021-05-07   \n",
       "453        “累吐了”！为救200斤男子，他飞奔跳入湖中  2021-05-07   \n",
       "454      暴雨后这些小飞虫又成群出现！专家提醒：记得关灯！  2021-05-06   \n",
       "455            老师揪小学生头发，致皮骨分离！已刑拘  2021-05-06   \n",
       "456           氧气短缺、全面停航！又一国疫情急剧恶化  2021-05-06   \n",
       "457                穗康码冲上热搜！网友都在晒图  2021-05-06   \n",
       "458        如果你没空读书，就一定要来看看这11个公众号  2021-05-06   \n",
       "459                  凤凰金融被查，控制人被抓  2021-05-06   \n",
       "460        世界最大性侵儿童网站被捣毁！曾有超40万会员  2021-05-06   \n",
       "461           华夏银行交出2020社会责任“成绩单”  2021-05-06   \n",
       "462                      有主张，有担当！  2021-05-06   \n",
       "463                      惠州一镇最新通报  2021-05-06   \n",
       "464         “肿瘤治疗黑幕”相关录音曝光！医生曾说……  2021-05-06   \n",
       "465                     国家发改委重磅声明  2021-05-06   \n",
       "466               高架坍塌，地铁坠落！已致25死  2021-05-06   \n",
       "467                     今早，黄之锋再获刑  2021-05-06   \n",
       "468          五一这群俊男靓女嗨翻了！结束后留一片狼藉  2021-05-06   \n",
       "469  破纪录！破纪录！五一档票房16亿+，但多部电影口碑大翻车  2021-05-06   \n",
       "470         “肿瘤治疗黑幕”发帖医生再发声：忍无可忍！  2021-05-06   \n",
       "471                   4人窒息死亡！官方通报  2021-05-06   \n",
       "472             饭馆老板泼热油致多人受伤，警方介入  2021-05-06   \n",
       "473           25岁产妇诞下九胞胎，网友：小说成真了  2021-05-06   \n",
       "474    能扛海边暴晒，能防电脑辐射，这才叫「真正的防晒衣」！  2021-05-06   \n",
       "475           “错换人生28年”起诉涉事医院案将开庭  2021-05-06   \n",
       "476             “抹黑网约车司机”，短视频博主道歉  2021-05-06   \n",
       "477             情侣沙滩边热吻，保安狂喊：别亲了！  2021-05-06   \n",
       "\n",
       "                                                  link  \\\n",
       "438  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "439  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "440  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "441  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "442  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "443  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "444  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "445  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "446  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "447  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "448  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "449  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "450  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "451  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "452  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "453  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "454  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "455  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "456  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "457  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "458  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "459  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "460  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "461  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "462  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "463  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "464  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "465  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "466  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "467  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "468  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "469  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "470  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "471  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "472  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "473  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "474  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "475  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "476  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "477  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "\n",
       "                                          content_text  \n",
       "438  继昨天（5月6日）深夜爱奇艺道歉后，今早（5月7日）9时许，《青春有你3》赞助商、“倒奶事件...  \n",
       "439  3日，有网友爆料，在陕西宝鸡太白县秦岭，有驴友非法穿越“鳌太线”，多人失联。昨日（6日）上午...  \n",
       "440  日前，第十三届中国音乐金钟奖古筝比赛贵州选拔赛现场，有选手质疑评审公正性被怼“你算什么东西”...  \n",
       "441  微软公司创始人与妻子、慈善家日前宣布将离婚，结束长达27年的婚姻（）。两人的1300亿美元资...  \n",
       "442  尼泊尔新冠肺炎确诊病例激增、多名登山者出现严重咳嗽症状持续引发关注。5月6日，据央视新闻消息...  \n",
       "443  2021年4月15日，香港警察学院操场。当香港警察仪仗队肩扛的旗帜顶端出现在检阅场入口时，安...  \n",
       "444  前晚（5月5日），一则“82岁北京高知退休老人称被养老院骗走12万”的消息，引发关注。昨天（...  \n",
       "445  近日，广东揭阳00后男生制作的纸片定格动画《蓝鲸》在网上走红。苏梓凡是来自湖北美术学院的大二...  \n",
       "446  昨天（5月6日）深夜23时46分，@爱奇艺 发微博称“我们真诚地道歉！”：我们听到了用户及媒...  \n",
       "447  5月4日，香港娱乐圈再次传出噩耗。据媒体报道，有着“抗癌歌手”之称的31岁女星（Sarena...  \n",
       "448  昨天（5月6日），海南三亚一“红衣女子在高楼阳台外跳舞随后坠楼”的视频引发关注。5月6日18...  \n",
       "449  近日，南都记者接到来自广州、深圳以及上海等地近20位家长投诉，反映其在高端出国留学机构“藤门...  \n",
       "450  穿上身，肩带轻轻环绕，勾勒出美丽的肩颈线条，飘然垂坠，凉快又优雅。很多人觉得“性感”=吊带=...  \n",
       "451  “外交部发言人办公室”消息，在5月6日外交部例行记者会上，总台央视记者提问：5日，七国集团外...  \n",
       "452  河北承德市某游乐场一台游乐设施“高空飞翔”5月2日发生故障、致27人受困后，5月4日，国家市...  \n",
       "453  5月3日，江苏淮安，酒后在洪泽湖边散步，在湖边洗脸时不慎落入水中。岸边的家人都不熟悉水性，便...  \n",
       "454  这个季节的广东人真的太！难！了！前几天的暴雨后它们 又来了！就是飞！蚁！（也叫每次大雨过后都...  \n",
       "455  因上课时“讲小话”，9岁的被老师拖拽头发到讲台罚站，事后孩子头部异常肿胀，被查出头皮头骨分离...  \n",
       "456  “局势已经失去控制，我们正处于无助的境地”，这是近日一位尼泊尔的医生就当地新冠肺炎疫情发出的...  \n",
       "457  近日，山东将已经完成疫苗接种人群的健康码，升级为金色健康码。加了镶金的“金钟罩”让全国网友变...  \n",
       "458  ♥长按，选择订阅▼ID：mrxdsfz推荐理由： 你是不是总因为“没时间而放弃阅读”，现在天...  \n",
       "459  昨晚（5月5日），海南海口公安官方发布警情通报称，2021年4月30日，海口市公安局龙华分局...  \n",
       "460  德国警方5月3日捣毁了，抓获4名被控经营该网站的4名嫌疑人。据悉，该网站自2019年7月成立...  \n",
       "461  近日，华夏银行正式发布《华夏银行股份有限公司2020年社会责任报告》。这是该行自2009年建...  \n",
       "462  “少年不负勇往，热爱正当时”。伴随着这句话，从一个个优秀青年的胸腔中迸发而出，视频《新少年说...  \n",
       "463  五一假期期间，有广东惠州市民收到这样一则信息，“5月2日晚接到上级核查通知，我镇荷树塘村三坑...  \n",
       "464  “肿瘤治疗黑幕”风波中提供“NK细胞免疫治疗”的机构，藏身上海长宁区一个企业广场的写字楼中，...  \n",
       "465  今早（5月6日），国家发展改革委发布“关于无限期暂停中澳战略经济对话机制下一切活动的声明”：...  \n",
       "466  据央视新闻报道，当地时间3日晚上，墨西哥首都墨西哥城东南部一段高架轨道发生坍塌，经行列车的多...  \n",
       "467  今天（5月6日）上午，乱港分子等四人承认2020年一起非法集结控罪，四人在香港区域法院分别被...  \n",
       "468  5月3日，江苏南京，草莓音乐节演出结束以后，现场垃圾遍布，工作人员冒雨捡垃圾。有工作人员说，...  \n",
       "469  继史上最强清明档后，中国电影又迎来了史上最“挤”五一档。13部电影扎根“五一档”同台竞技，数...  \n",
       "470  昨天（5月5日），“北医三院肿瘤内科医生反映肿瘤治疗黑幕”事件当事人再度发声。在半个月的沉默...  \n",
       "471  5月4日，广东省应急管理厅通报了一起“五一”期间有限空间作业致4名人员死亡的较大事故。5月1...  \n",
       "472  5月1日22时许，在湖南衡阳雁峰区经营餐馆的彭先生被相邻饭店的老板泼热油，造成多人受伤。彭先...  \n",
       "473  据新华国际报道，西非国家马里一名25岁产妇4日在摩洛哥一家医院产下九胞胎，比产前超声波检测到...  \n",
       "474  （UVA、UVB），这样的颜值，在也毫不逊色哇，顺便还能保护肌肤不受蓝光伤害，一举两得。更多...  \n",
       "475  昨天（5月5日），南都记者从“错换人生28年”当事人养母处获悉，其与家人起诉河南大学淮河医院...  \n",
       "476  “因为我的投诉，滴滴司机被罚了5000块。”近日，几个内容高度相似的视频在短视频平台上传播，...  \n",
       "477  5月3日，上海渔人码头一人工沙滩海水涨潮，一对情侣在水中热吻忘记上岸，保安大喊别亲了，涨潮了...  "
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 找出重复项\n",
    "df_url_out[df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "ae02568f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "      <th>content_text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>最新！茂名新增1例无症状感染者，系广州荔湾茶楼服务员</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据“茂名发布”今天（5月25日）中午13时06分消息：5月25日，广东省茂名市在对广州新冠肺...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>广西一男子核酸检测阳性，曾在广州荔湾隔离14天</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据广西南宁市卫生健康委员会今天（5月25日）通报，2021年5月24日，南宁市根据广州市荔湾...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>人人人人人……最近广州多地火爆，突破1000万了</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据国家卫生健康委官网消息截至2021年5月23日31个省（区、市）及新疆生产建设兵团累计报告...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>这种玩具走红，不少人沉迷！所有人注意，千万别买</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>如今孩子们的玩具种类可以说是五花八门近年，一种叫“假水”的玩具在中小学生之间迅速走红▼甚至不...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>广州一高架桥桥墩出现沉降，已封闭</td>\n",
       "      <td>2021-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>南都记者获悉，2021年5月24日，第三方检测单位在广州广园路下塘西立交桥桥墩开展日常检测时...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2366</th>\n",
       "      <td>终于，人设崩塌！</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>去年立的Flag都实现了吗？牛年又想立下什么Flag呢？新年的好意头都在福盒里啦点击下图问号...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2367</th>\n",
       "      <td>“中国最后一个原始部落”突发火灾</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>据云南沧源佤族自治县人民政府新闻办公室消息，2月14日17时40分，该县勐角民族乡翁丁村老寨...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2368</th>\n",
       "      <td>夫妻要求医院返还冷冻胚胎，判了</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>人体胚胎含有DNA遗传物质具有生命属性，不是民法上的一般物，在医疗服务合同目的落空后，当事人...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2369</th>\n",
       "      <td>广东这个地方，美呆了！</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>由于天气持续温暖，今年入春不久的潮州，迎来了韩江两岸木棉花的绽放。繁花满树，艳丽夺目，在古城...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2370</th>\n",
       "      <td>东莞一女子误将1.1万现金当垃圾扔了，结果……</td>\n",
       "      <td>2021-02-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...</td>\n",
       "      <td>2月12日（大年初一），东莞市塘厦镇市民陈女士在垃圾填埋场，紧握着几名环卫工人的手，连声说着...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2331 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                           title create_time  \\\n",
       "0     最新！茂名新增1例无症状感染者，系广州荔湾茶楼服务员  2021-05-25   \n",
       "1        广西一男子核酸检测阳性，曾在广州荔湾隔离14天  2021-05-25   \n",
       "2       人人人人人……最近广州多地火爆，突破1000万了  2021-05-25   \n",
       "3        这种玩具走红，不少人沉迷！所有人注意，千万别买  2021-05-25   \n",
       "4               广州一高架桥桥墩出现沉降，已封闭  2021-05-25   \n",
       "...                          ...         ...   \n",
       "2366                    终于，人设崩塌！  2021-02-15   \n",
       "2367            “中国最后一个原始部落”突发火灾  2021-02-15   \n",
       "2368             夫妻要求医院返还冷冻胚胎，判了  2021-02-15   \n",
       "2369                 广东这个地方，美呆了！  2021-02-15   \n",
       "2370     东莞一女子误将1.1万现金当垃圾扔了，结果……  2021-02-15   \n",
       "\n",
       "                                                   link  \\\n",
       "0     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "1     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "3     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "4     http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "...                                                 ...   \n",
       "2366  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2367  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2368  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2369  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "2370  http://mp.weixin.qq.com/s?__biz=MTk1MjIwODAwMQ...   \n",
       "\n",
       "                                           content_text  \n",
       "0     据“茂名发布”今天（5月25日）中午13时06分消息：5月25日，广东省茂名市在对广州新冠肺...  \n",
       "1     据广西南宁市卫生健康委员会今天（5月25日）通报，2021年5月24日，南宁市根据广州市荔湾...  \n",
       "2     据国家卫生健康委官网消息截至2021年5月23日31个省（区、市）及新疆生产建设兵团累计报告...  \n",
       "3     如今孩子们的玩具种类可以说是五花八门近年，一种叫“假水”的玩具在中小学生之间迅速走红▼甚至不...  \n",
       "4     南都记者获悉，2021年5月24日，第三方检测单位在广州广园路下塘西立交桥桥墩开展日常检测时...  \n",
       "...                                                 ...  \n",
       "2366  去年立的Flag都实现了吗？牛年又想立下什么Flag呢？新年的好意头都在福盒里啦点击下图问号...  \n",
       "2367  据云南沧源佤族自治县人民政府新闻办公室消息，2月14日17时40分，该县勐角民族乡翁丁村老寨...  \n",
       "2368  人体胚胎含有DNA遗传物质具有生命属性，不是民法上的一般物，在医疗服务合同目的落空后，当事人...  \n",
       "2369  由于天气持续温暖，今年入春不久的潮州，迎来了韩江两岸木棉花的绽放。繁花满树，艳丽夺目，在古城...  \n",
       "2370  2月12日（大年初一），东莞市塘厦镇市民陈女士在垃圾填埋场，紧握着几名环卫工人的手，连声说着...  \n",
       "\n",
       "[2331 rows x 4 columns]"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 余下为不重复的部分\n",
    "df_url_out[~df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "1f4289fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将抓取到的内容保存到本地 —— 数据输出\n",
    "with pd.ExcelWriter('{公众号}公众号链接及文章内容.xlsx'.format(公众号=公众号),mode='w',engine=\"openpyxl\") as writer:  \n",
    "            df_url_out.to_excel(writer, sheet_name=公众号)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6rc1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
